Shreya Chowdhury Shopify Winter 2022 Data Science Internship Challenge

Question 1: Given some sample data, write a program to answer the following:

On Shopify, we have exactly 100 sneaker shops, and each of these shops sells only one model of shoe. We want to do some analysis of the average order value (AOV). When we look at orders data over a 30 day window, we naively calculate an AOV of 3145.13. Given that we know these shops are selling sneakers, a relatively affordable item, something seems wrong with our analysis.

a. Think about what could be going wrong with our calculation. Think about a better way to evaluate this data.

Looking at the data set and considering the incredibly high calculated value of 3145.13, my guess would be that in attempting to calculate the average order value, the mean of the order_amount column was taken.

Looking at the summary statistics for the dataset, my suspicion is confirmed. We see that $3145.13 is the mean of the the order amount column. We can also see that the standard deviation for the order amount is incredibly high, $41,282.54 so the data must be spread out. We know that the mean is very affected by outliers, so it's likely that there are a few order amounts that are much higher than the majority of them skewing the data. We can inspect this data further.

We see that the max order amount is $704,000, which is incredibly high. Let's take a look at some of the the higher order_amounts; the top 20 for now.

Looking at this, we see that the highest order amounts are all the same value of $704,000 at the same shop, purchased by the same user. This could be caused by either a really high priced sneaker or the high number of items purchased.

If we calculate the value of each individual sneaker at shop 53 we get a price of $352, a relatively reasonable value. When we take the mean of the average sneaker price, it is \$387.74, a value close to $352. This means the large order amount can be attributed to the large total number of items. From the summary statistics for the total items, we see that the average number of items is roughly 8 with a high standard deviation of 116. The median statistic of total items ordered is 2. This is significantly lower than 2000, the amount user 607 was purchasing. All of this suggests we should consider looking at a statistic that is less affected by the mean to describe this data. We can evaluate some visualizations for more insight.

To reduce the long tail, we can look at the density plot in a log scale.

According to the density plot of the order amount,there are few orders worth hundreds of thousands of dollars. These are outliers, because the majority of the orders are worth less than $50,000. In the log-scale plot, we confirm this by seeing the order amounts with the highest densities (most common) are between \$100 to $1000, values which are much more reflective of affordable sneaker prices. To get a better idea of these outliers, we can look at some box-and-whisker plots.

While we can see the outliers clearly here, we can adjust the plot with limits to get a better idea of the quartiles.

b. What metric would you report for this dataset?

Upon closer inspection we see that there are a lot of outliers above the price of $750. This is likely dragging the mean up. Excluding these values, we see that the majority of the distribution has lower values between \$150 and $400. A better metric to report for this dataset would be the median value, as it is not as dependent on all the values in the dataset and much less susceptible to outliers. In cases such as the one above with extreme outlying values, the median is a better measure of the central tendency than the mean and more reflective of the distribution.

c. What is its value?

The median value or value of the 50th percentile of the dataset is $284.00.

a. How many orders were shipped by Speedy Express in total?

Speedy Express shipped a total of 54 orders.

b. What is the last name of the employee with the most orders?

With the query above, we get the number of orders per employee in descending order. If we limit the table by 1, we get only the top instance, with the employee with the most orders.

The last name of the employee with the most orders is Peacock, with a total of 40 orders.

c. What product was ordered the most by customers in Germany?

To see the product ordered most by customers in Germany, we need to join Orders, Customers,OrderDetails, and Products all together. From there, we can filter the Country so we just get the orders for Germany. After that we can group by the product name, and only select the necessary columns such as Country, ProductName, and calculate a new column with the total quantity of the product being ordered the most.

Once we have that, we see a simplified table with just the Country, ProductName, and TotalOrdered. In order to get the max, or the item ordered the most, we can order the TotalOrdered Column in descending order and then limit the table by 1.

From the query above, we can conclude that the product ordered the most by customers in Germany is the Boston Crab Meat with a total of 160 orders.